In this exercise we analyse correlation between HDI and gender (in)equality. Our results indicate strong correlation between main determinants of GII and HDI. We cannot prove causation, but merely suggest that gender equality should be considered when trying to improve HDI.
This is my Final assignment for the University of Helsinki course: Introduction to Open Data Science (IODS).
The R-code used is stored and available for inspection in this GitHub repository
Instead of following the instructions and making code-panel foldable, I decided to create my own code-wrapper, with some panels foldable and others always visible. If you find my solution too hard to evaluate, I’m happy to deliver the exercise as per your instructions.
This dataset contains 155 obseravtions and 7 variables exploring dependancy between proxy variables of human developement and gender (in)equality by country. The data originates from http://hdr.undp.org/en/data. Details of data-wrangling can be found here.
HDI attempts to describe quality of life – or how developed a country is, and comprises of life expectancy, education and GNI per capita.
Female labour participation, education, reproductive health and parliamentary representation determine another developemental index, Gender inequality index, (GII)
In this exercise, I explore whether determinats of GII correlate with HDI. My hypothesis is that countries with low GII have better HDI. It’ll be interesting to see which of the determinants of GII correlate most strongly with HDI.
| Variable | Description |
|---|---|
| HDI | Human developement index |
| GII | Gender inequality index |
| matMort | Maternal mortality ratio |
| adolBirthRate | Adolescent birth rate |
| reprParl | Percetange of female representatives in parliament |
| eduRatio | Ratio between edu2F and edu2M |
| labRatio | Ratio between labF and labM |
#Hidden
df %>% select(-country) %>% summaryKable() %>%
kable("html", align = "rrr", caption = "Study variable summary") %>%
kable_styling(bootstrap_options = c("hover", "condensed"))
| HDI | GII | matMort | adolBirthRate | reprParl | eduRatio | labRatio | |
|---|---|---|---|---|---|---|---|
| Min | 0.348 | 0.016 | 1.000 | 0.600 | 0.000 | 0.172 | 0.186 |
| 1st Q | 0.593 | 0.184 | 11.500 | 12.650 | 12.400 | 0.726 | 0.598 |
| Median | 0.733 | 0.385 | 49.000 | 33.600 | 19.300 | 0.938 | 0.753 |
| Mean | 0.704 | 0.366 | 149.084 | 47.159 | 20.912 | 0.853 | 0.707 |
| 3rd Q | 0.829 | 0.524 | 190.000 | 71.950 | 27.950 | 0.997 | 0.854 |
| Max | 0.944 | 0.744 | 1100.000 | 204.800 | 57.500 | 1.497 | 1.038 |
These values display no alarming features. Interestingly median of eduRatio is ~1, which I would’ve expected to be much smaller.
#Hidden
df %>% gather(key="var", value = "value", HDI:labRatio) %>%
mutate(var=ordered(var, names(df[2:8]))) %>% #To retain order when facetting
ggplot() +
geom_density(aes(value)) +
facet_wrap(~var, scales="free") +
labs(title = "Study variable distribution",
y = "Density", x="Value") +
theme(axis.text.x = element_text(angle = 90))
Most variables are normally enough for my purpose. Maternal mortality si a bit skewed, and has long tail. Logarithmic conversion should make it more normal-like.
#Hidden
# Save a plot of non-normally distributed variables
p1 <- df %>%
ggplot() +
geom_density(aes(matMort)) +
labs(title = "Before log-conv",
y = "Density", x="Value")
#Log conversion for non-normal values
df %<>% mutate(matMort = log10(matMort)) %>%
rename(matMort_log = matMort)
# Store a plot of new values
p2 <- df %>%
ggplot() +
geom_density(aes(matMort_log)) +
labs(title = "After log-conv",
y = "Density", x="Value")
#Show plots
multiplot(p1,p2, cols = 2)
Clearly the distribution is better after the log-conversion.
#Hidden
p <- df %>% select(-country) %>% ggpairs(
title = "Study variable overview",
upper = list(continuous = wrap("cor", size = 4, color = "black")),
lower = list(
continuous = wrap("points", alpha = .2, size = .4),
combo = wrap("facethist", bins = 20))) +
theme(axis.text.x = element_text(
angle = 90,
color = "black",
size = 7,
vjust = .5),
axis.text.y = element_text(color = "black", size = 7),
strip.text = element_text(size = 8))
p[3,1] <- p[3,1] + aes(color = "red")
p[4,1] <- p[4,1] + aes(color = "red")
p[6,1] <- p[6,1] + aes(color = "red")
p[1,3] <- p[1,3] + aes(fontface = 2)
p[1,4] <- p[1,4] + aes(fontface = 2)
p[1,6] <- p[1,6] + aes(fontface = 2)
p
This graph contains a wealth of information and plenty of time should be used to familiarize oneself with it. From the scatterplots, we can conclude that many of the variables correlate with each other, and importantly, that no obvious non-linear relationships emerge from the data.
Both, maternal mortality and adolescent birth rate, correlate strongly and negatively with HDI.
EduRatio exhibits moderate positive correlation R=0,68. The relationship is not only linear; for most of HDI-spectrum there is no correlation, but before a certain threshold, a positive slope is visible (i.e. HDI correlates positively with eduRatio).
The actual gender inequality index correlates moderately and negatively with the HDI (and actually, all components of HDI, not shown).
Earlier I stated that there seems to be some kind of plateu part in the correlation between HDI and education ratio (ie. after certain threshold, correlation between the variables fades). I guess that in countries with low education, there is also high inequality, and reduction of this inter-gender difference only helps so far.
#Hidden
df %<>% mutate(eduRatioT = ntile(df$eduRatio, 5))
p1 <- ggplot(df) +
geom_point(aes(eduRatio, HDI, color = as.factor(eduRatioT))) +
scale_color_discrete(guide = FALSE)
df %<>% mutate(eduIneq = abs(1-eduRatio))
p2 <- ggplot(df) +
geom_point(aes(eduIneq, HDI, color = as.factor(eduRatioT))) +
scale_color_discrete(name = "EduRatio \n quintile") +
theme(legend.position = c(.9, .85))
multiplot(p1, p2, cols = 2)
Based on this plot, my initial assumption wasn’t entirely correct. Yes, HDI increases with eduRatio, but seems to peak when there is no difference between genders, and decreases after that.
We start by analyzing the proposed correlations between HDI and maternal mortality, adolscent birth rate, and education ratio.
#Hidden
# New model
model <- lm(HDI ~ matMort_log + adolBirthRate + eduRatio,
data = df)
## Regression plots
par(mfrow = c(2,2), oma = c(0, 0, 2, 0),
mar = c(2.5,3,2,0.5),
mgp = c(1.5,.5,0))
plot(model, which = c(1,2), add.smooth = T)
norm.res <- model$residuals/(sqrt(deviance(model)/df.residual(model))*sqrt(1-hatvalues(model)))
# Counted the normalized residuals long way for fun. Following code can be used to check results
# sum(norm.res != rstandard(model))
aa <- df$HDI
leverage <- (aa-mean(aa))^2/sum((aa-mean(aa))^2)+1/length(aa)
plot(leverage, norm.res, xlab = "Leverage", ylab = "Standardized residuals")
plot(cooks.distance(model), norm.res, xlab = "Cook's distance", ylab = "Standardized residuals")
Based on the Cook’s distance, our model is bogged with a influential outlier. The observation in question is Myanmar, which is excluded from further analysis.
#Hidden
# Remove the outlier
df <- df[-tail(order(cooks.distance(model)),1),]
# New model
model <- lm(HDI ~ matMort_log + adolBirthRate + eduRatio,
data = df)
## Regression plots
par(mfrow = c(2,2), oma = c(0, 0, 2, 0),
mar = c(2.5,3,2,0.5),
mgp = c(1.5,.5,0))
plot(model, which = c(1,2), add.smooth = T)
norm.res <- model$residuals/(sqrt(deviance(model)/df.residual(model))*sqrt(1-hatvalues(model)))
# Counted the normalized residuals long way for fun. Following code can be used to check results
# sum(norm.res != rstandard(model))
aa <- df$HDI
leverage <- (aa-mean(aa))^2/sum((aa-mean(aa))^2)+1/length(aa)
plot(leverage, norm.res, xlab = "Leverage", ylab = "Standardized residuals")
plot(cooks.distance(model), norm.res, xlab = "Cook's distance", ylab = "Standardized residuals")
There appears to be one more outlier, with high influence. This time it’s Belarus, which is excluded from further analysis.
#Hidden
# Remove the outlier
df <- df[-tail(order(cooks.distance(model)),1),]
# New model
model <- lm(HDI ~ matMort_log + adolBirthRate + eduRatio,
data = df)
## Regression plots
par(mfrow = c(2,2), oma = c(0, 0, 2, 0),
mar = c(2.5,3,2,0.5),
mgp = c(1.5,.5,0))
plot(model, which = c(1,2), add.smooth = T)
norm.res <- model$residuals/(sqrt(deviance(model)/df.residual(model))*sqrt(1-hatvalues(model)))
# Counted the normalized residuals long way for fun. Following code can be used to check results
# sum(norm.res != rstandard(model))
aa <- df$HDI
leverage <- (aa-mean(aa))^2/sum((aa-mean(aa))^2)+1/length(aa)
plot(leverage, norm.res, xlab = "Leverage", ylab = "Standardized residuals")
plot(cooks.distance(model), norm.res, xlab = "Cook's distance", ylab = "Standardized residuals")
According to diagnostic plots, this model has no critical errors:
1. Residuals are the difference between fitted and the actual value. In this plot we see no clustering, or any other patterns, that could indicate problems in the model. Variance of errors is constant
2. The Q-Q plot demonstrates that model performs well with median values, but at low values, it starts to accumulate error.
3. In this model, no observations have high leverage.
4. No single observation affects model too much.
summary(model)
##
## Call:
## lm(formula = HDI ~ matMort_log + adolBirthRate + eduRatio, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.161452 -0.033929 0.002323 0.035044 0.138537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.8222244 0.0273297 30.085 < 2e-16 ***
## matMort_log -0.1490740 0.0103201 -14.445 < 2e-16 ***
## adolBirthRate -0.0003674 0.0001681 -2.185 0.0304 *
## eduRatio 0.1807900 0.0212869 8.493 1.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05018 on 149 degrees of freedom
## Multiple R-squared: 0.8974, Adjusted R-squared: 0.8953
## F-statistic: 434.5 on 3 and 149 DF, p-value: < 2.2e-16
The model explains 89% of variance of HDI. All our explanatory variables correlate with significantly with HDI, although adolescent birth rate just barely. This low value is probably due to collinearity between maternal mortality and adolBirthRate.
It’d most interesting to explore why Belarus and Myanmar were such a strong outliers in this model. My gut-feeling would be that in case of Myanmar has fairly low rates of maternal mortality and adolescent birth rate, and that HDI is lagging behind rapid changes in these values. For Belarus, I’d suppose that it has relatively high HDI compared to differences in education. These are merely my personal ponderings and I will not explore these outliers more deeply.
Next I wanted to explore LDA, a form of dimensionality reduction.
First, the dependent variable, HDI, is divided to 4 equally sized categories. All other variables are scaled to have
#Hidden
# Explanatory variables are scaled
df %<>% mutate_all(funs(scale))
# sapply(df, sd)
# HDI is divided into quartiles
df$HDI <- ntile(df$HDI, 4) %>%
factor(labels = c("low", "med-lo", "med-hi", "high"))
# The data is sampled into training and testing sets
sample_ind <- sample(nrow(df), size = nrow(df) * 0.8)
train <- df[sample_ind,]
test <- df[-sample_ind,]
#data.frame(dim(train),dim(test))
#LDA
lda.fit <- lda(HDI ~ ., data = train)
lda.fit
## Call:
## lda(HDI ~ ., data = train)
##
## Prior probabilities of groups:
## low med-lo med-hi high
## 0.2786885 0.2213115 0.2704918 0.2295082
##
## Group means:
## matMort_log adolBirthRate reprParl labRatio eduIneq
## low 1.2678408 1.15206192 -0.14125200 0.3672907 1.3033668
## med-lo 0.2649995 -0.06024558 -0.08287924 -0.2742146 -0.2148443
## med-hi -0.4002100 -0.39092174 -0.21437236 -0.2049382 -0.5017317
## high -1.2842240 -0.88781541 0.44235476 0.3207093 -0.6437285
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## matMort_log 2.08351901 -1.2848210 7.566608e-01
## adolBirthRate -0.16044762 0.3834058 -7.377454e-01
## reprParl -0.07514800 0.1977839 8.649420e-01
## labRatio 0.06350302 0.6644102 -3.612129e-05
## eduIneq 0.67966255 1.1817252 -1.030491e-02
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9235 0.0734 0.0030
Almost all variation in the data is explained just by maternal mortality and education inequality.
#Hidden
library(plotly, warn.conflicts = F)
# Plot the lda results
points <- data.frame(HDI = train$HDI,
lda = predict(lda.fit)$x)
levels(points$HDI) %<>% str_to_title()
arrows <- coef(lda.fit) %>%
data.frame(., label = rownames(.)) %>% arrange(desc(abs(LD1))) %>%
mutate(LD1 = LD1*2.5, LD2 = LD2*2.5, LD3 = LD3*2.5, pos = 1) %>%
rbind(., mutate(., LD1=0, LD2=0, LD3=0, pos =0))
p1 <- plot_ly(arrows, x = ~LD1, y = ~LD2, z = ~LD3,
type = "scatter3d" , color = ~label, colors = rep(rgb(0, 0, 0), 13),
opacity = .5, mode = "lines", hoverinfo = "name", showlegend = FALSE,
line = list(width = 5))
p2 <- plot_ly(points, x = ~lda.LD1, y = ~lda.LD2, z = ~lda.LD3,
type = "scatter3d" , color = ~HDI, opacity = .75, hoverinfo = "none",
mode = "markers", marker = list(size = 3, width = 2)) %>%
layout(title = "PCA",
scene = list(xaxis = list(title = "LDA1"),
yaxis = list(title = "LDA2"),
zaxis = list(title = "LDA3")))
subplot(p1, p2)
table("HDI" = test$HDI,
"Prediction" = predict(lda.fit, newdata = test)$class)
## Prediction
## HDI low med-lo med-hi high
## low 5 0 0 0
## med-lo 0 10 1 0
## med-hi 0 2 3 0
## high 0 1 2 7
Model works very well. It correctly categorizes 25/31 observations of the test set.
In this study, we explored wheter determinants of GII correlate with HDI. Our analysis concludes that education inequality, and more importantly, maternal mortality correlate strongly with HDI. On our linear regression model, we could explain nearly 90% of variance in HDI with just maternal mortality, adolescent birth rate and education inequality. Same trend was clearly visible in LDA.
Interestingly, we identified couple outliers (Belarus and Myanmar), both of which merit further studies.
Based on these findings, which are in line with our current understanding of developemental drivers, improving gender euqlity should help raise HDI. These results should be considered when deciding on interventions for raising HDI.